Optimized photorealistic audiovisual speech synthesis using active appearance modeling
نویسندگان
چکیده
Active appearance models can represent image information in terms of shape and texture parameters. This paper explains why this makes them highly suitable for data-based 2D audiovisual text-to-speech synthesis. We elaborate on how the differentiation between shape and texture information can be fully exploited to create appropriate unit-selection costs and to enhance the video concatenations. The latter is very important since for the synthetic visual speech a careful balancing between signal smoothness and articulation strength is required. Several optimization strategies to enhance the quality of the synthetic visual speech are proposed. By measuring the properties of each model parameter, an effective normalization of the visual speech database is feasible. In addition, the visual joins can be optimized by a parameter-specific concatenation smoothing. To further enhance the naturalness of the synthetic speech, a spectrum-based smoothing approach is introduced.
منابع مشابه
Active appearance models for photorealistic visual speech synthesis
The perceived quality of a synthetic visual speech signal greatly depends on the smoothness of the presented visual articulators. This paper explains how concatenative visual speech synthesis systems can apply active appearance models to achieve a smooth and natural visual output speech. By modeling the visual speech contained in the system’s speech database, a diversification between the synth...
متن کاملAuditory and photo-realistic audiovisual speech synthesis for Dutch
Both auditory and audiovisual speech synthesis have been the subject of many research projects throughout the years. Unfortunately, in recent years only very few research focuses on synthesis for the Dutch language. Especially for audiovisual synthesis, hardly any available system or resource can be found. In this paper we describe the creation of a new extensive Dutch speech database, containi...
متن کاملUsing multimodal speech production data to evaluaterticulatory animation for audiovisual speech synthesis
The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in ...
متن کاملInfluenсe of Phone-Viseme Temporal Correlations on Audiovisual STT and TTS Performance
In this paper, we present a research of temporal correlations of audiovisual units in continuous Russian speech. The corpus-based study identifies natural time asynchronies between flows of audible and visible speech modalities partially caused by inertance of the articulation organs. Original methods for speech asynchrony modeling have been proposed and studied using bimodal ASR and TTS system...
متن کاملMikeTalk: A Talking Facial Display Based on Morphing Visemes
We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010